arXiv:1505.06798v2 [cs.CV] 18 Nov 2015 


1 


Accelerating Very Deep Convolutional 
Networks for Classification and Detection 

Xiangyu Zhang, Jianhua Zou, Kaiming He^ and Jian Sun 


Abstract —This paper aims to accelerate the test-time computation of convolutional neural networks (CNNs), especially very 
deep CNNs [1] that have substantially impacted the computer vision community. Unlike previous methods that are designed 
for approximating linear filters or linear responses, our method takes the nonlinear units into account. We develop an effective 
solution to the resulting nonlinear optimization problem without the need of stochastic gradient descent (SGD). More importantly, 
while previous methods mainly focus on optimizing one or two layers, our nonlinear method enables an asymmetric reconstruction 
that reduces the rapidly accumulated error when multiple {e.g., >10) layers are approximated. For the widely used very deep 
VGG-16 model [1], our method achieves a whole-model speedup of 4x with merely a 0.3% increase of top-5 error in ImageNet 
classification. Our 4x accelerated VGG-16 model also shows a graceful accuracy degradation for object detection when plugged 
into the Fast R-CNN detector [2]. 

Index Terms —Convolutional Neural Networks, Acceleration, Image Classification, Object Detection 
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1 Introduction 

The accuracy of convolutional neural networks 
(CNNs) [3], [4] has been continuously improving [5], 
[6], [7], [1], [8], but the computational cost of these 
networks also increases significantly For example, the 
very deep VGG models [1], which have witnessed 
great success in a wide range of recognition tasks [9], 
[2], [10], [11], [12], [13], [14], are substantially slower 
than earlier models [4], [5]. Real-world systems may 
suffer from the low speed of these networks. For 
example, a cloud service needs to process thousands 
of new requests per seconds; portable devices such as 
phones and tablets may not afford slow models; some 
recognition tasks like object detection [7], [2], [10], 
[11] and semantic segmentation [12], [13], [14] need 
to apply these models on higher-resolution images. It 
is thus of practical importance to accelerate test-time 
performance of CNNs. 

There have been a series of studies on accelerating 
deep CNNs [15], [16], [17], [18]. A common focus 
of these methods is on the decomposition of one or 
a few layers. These methods have shown promising 
speedup ratios and accuracy on one or two layers and 
whole (but shallower) models. However, few results 
are available for accelerating very deep models {e.g., 
>10 layers). Experiments on complex datasets such 
as ImageNet [19] are also limited - e.g., the results in 
[16], [17], [18] are about accelerating a single layer of 
the shallower AlexNet [4]. Moreover, performance of 
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the accelerated networks as generic feature extractors 
for other recognition tasks [2], [12] remain unclear. 

It is nontrivial to speed up whole, very deep models 
for complex tasks like ImageNet classification. Acceler¬ 
ation algorithms involve not only the decomposition 
of layers, but also the optimization solutions to the 
decomposition. Data (response) reconstruction solvers 
[17] based on stochastic gradient descent (SGD) and 
backpropagation work well for simpler tasks such 
as character classification [17], but are less effective 
for complex ImageNet models (as we will discussed 
in Sec. 4). These SGD-based solvers are sensitive to 
initialization and learning rates, and might be trapped 
into poorer local optima for regressing responses. 
Moreover, even when a solver manages to accelerate 
a single layer, the accumulated error of approximating 
multiple layers grow rapidly, especially for very deep 
models. Besides, the layers of a very deep model 
may exhibit a great diversity in filter numbers, feature 
map sizes, sparsity, and redundancy. It may not be 
beneficial to uniformly accelerate all layers. 

In this paper, we present an accelerating method 
that is effective for very deep models. We first pro¬ 
pose a response reconstruction method that takes 
into account the nonlinear neurons and a low-rank 
constraint. A solution based on Generalized Singular 
Value Decomposition (GSVD) is developed for this 
nonlinear problem, without the need of SGD. Our 
explicit treatment of the nonlinearity better models 
a nonlinear layer, and more importantly, enables an 
asymmetric reconstruction that accounts for the error 
from previous approximated layers. This method ef¬ 
fectively reduces the accumulated error when mul¬ 
tiple layers are approximated sequentially We also 
present a rank selection method for adaptively de¬ 
termining the acceleration of each layer for a whole 
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model, based on their redundancy. 

In experiments, we demonstrate the effects of fhe 
nonlinear solufion, asymmefric reconsfrucfion, and 
whole-model acceleration by confrolled experimenfs 
of a 10-layer model on ImageNef classification [19]. 
Furfhermore, we apply our mefhod on fhe pub¬ 
licly available VGG-16 model [1], and achieve a 4x 
speedup wifh merely a 0.3% increase of fop-5 cenfer- 
view error. 

The impacf of fhe ImageNef dafasef [19] is nof 
merely on fhe specific 1000-class classification fask; 
deep models pre-frained on ImageNef have been ac¬ 
tively used fo replace hand-engineered feafures, and 
have showcased excellenf accuracy for challenging 
fasks such as objecf defection [9], [2], [10], [11] and 
semantic segmenfafion [12], [13], [14]. We exploif our 
mefhod fo accelerafe fhe very deep VGG-16 model for 
Fasf R-CNN [2] objecf defection. Wifh a 4x speedup 
of all convolutions, our mefhod has a graceful degra¬ 
dation of 0.8% mAP (from 66.9% fo 66.1%) on fhe 
PASCAL VOC 2007 defection benchmark [20]. 

A preliminary version of fhis manuscripf has been 
presenfed in a conference [21]. This manuscripf ex- 
fends fhe inifial version from several aspecfs fo 
sfrengfhen our mefhod. (1) We demonsfrafe com¬ 
pelling accelerafion resulfs on very deep VGG models, 
and are among fhe firsf few works accelerafing very 
deep models. (2) We invesfigafe fhe accelerafed mod¬ 
els for fransfer-learning-based objecf defection [9], 
[2], which is one of fhe mosf imporfanf applications 
of ImageNef pre-frained nefworks. (3) We provide 
evidence showing fhaf a model framed from scrafch 
and sharing fhe same sfrucfure as fhe accelerafed 
model is inferior. This discovery suggesfs fhaf a very 
deep model can be accelerafed nof simply because fhe 
decomposed nefwork archifecfure is more powerful, 
buf because fhe accelerafion opfimizafion algorifhm is 
able fo digesf informafion. 

2 Related Work 

Mefhods [15], [16], [17], [18] for accelerafing fesf- 
fime compufafion of CNNs in general have two 
componenfs: (i) a layer decomposition design fhaf 
reduces time complexify, and (ii) an opfimizafion 
scheme for fhe decomposition design. Alfhough fhe 
former ("decomposition") affracfs more affenfion be¬ 
cause if direcfly addresses fhe fime complexify, fhe 
laffer ("opfimizafion") is also essential because nof all 
decompositions are similarly easy fo fine good local 
optima. 

The mefhod of Denfon et al. [16] is one of fhe 
firsf fo exploif low-rank decompositions of fibers. 
Several decomposition designs along differenf dimen¬ 
sions have been invesfigafed. This mefhod does nof 
explicifly minimize fhe error of fhe acfivafions affer 
fhe nonlinearify, which is influential fo fhe accuracy 
as we will show. This mefhod presenfs experimenfs 


of accelerafing a single layer of an OverFeaf nefwork 
[6], buf no whole-model resulfs are available. 

Jaderberg et al. [17] presenf efficienf decompositions 
by separafing k x k fillers info A: x 1 and 1 x A: fillers, 
which was earlier developed for accelerafing generic 
image fillers [22]. Charmel-wise dimension reducfion 
is also considered. Two opfimizafion schemes are pro¬ 
posed: (i) "filler reconsfrucfion" fhaf minimizes fhe er¬ 
ror of filler weighfs, and (ii) "dafa reconsfrucfion" fhaf 
minimizes fhe error of responses. In [17], conjugafe 
gradienf descenf is used fo solve filler reconsfrucfion, 
and SGD wifh backpropagafion is used fo solve dafa 
reconsfrucfion. Dafa reconsfrucfion in [17] demon- 
sfrafes excellenf performance on a characfer classifi¬ 
cation fask using a 4-layer nefwork. For ImageNef 
classification, fheir paper evaluafes a single layer of an 
OverFeaf nefwork by "filler reconsfrucfion". Buf fhe 
performance of whole, very deep models in ImageNef 
remains unclear. 

Concurrenf wifh our work, Lebedev et al. [18] adopf 
"CP-decomposifion" fo decompose a layer info five 
layers of lower complexify. For ImageNef classifica¬ 
tion, only a single-layer accelerafion of AlexNef is 
reporfed in [18]. Moreover, Lebedev et al. reporf fhaf 
fhey "failed fo find a good SGD learning rale" in fheir 
fine-funing, suggesting fhaf if is nonfrivial fo optimize 
fhe facforizafion for even a single layer in ImageNef 
models. 

Despife some promising preliminary resulfs fhaf 
have been obfained in fhe above works [16], [17], [18], 
fhe whole-model accelerafion of very deep nefworks for 
ImageNet is still an open problem. 

Besides fhe research on decomposing layers, fhere 
have been ofher sfreams on improving frain/fesf-fime 
performance of CNNs. FFT-based algorifhms [23], [24] 
are applicable for bofh framing and fesfing, and are 
particularly effective for large spatial kernels. On fhe 
ofher hand, if is also proposed fo frain "fhin" and 
deep nefworks [25], [26] for good frade-off between 
speed and accuracy. Besides reducing running time, a 
related issue involving memory conservation [27] has 
also attracted attention. 

3 Approaches 

Our method exploits a low-rank assumption for de¬ 
composition, following fhe sfream of [16], [17]. We 
show fhaf fhis decomposifion has a closed-form so¬ 
lufion (SVD) for linear neurons, and a slighfly more 
complicafed solufion (GSVD [28], [29], [30]) for non¬ 
linear neurons. The simplicify of our solver enables 
an asymmefric reconsfrucfion mefhod for reducing 
accumulafed error of very deep models. 

3.1 Low-rank Approximation of Responses 

Our assumption is fhaf fhe filfer response af a pixel of 
a layer approximafely lies on a low-rank subspace. A 
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Figure 1: Illustration of the decomposition, (a) An 
original layer with complexity 0{dk'^c). (b) An 
approximated layer with complexity reduced to 
0(d'Pc) + 0(dd'). 

resulting low-rank decomposition reduces time com¬ 
plexity. To find fhe approximafe low-rank subspace, 
we minimize fhe reconsfrucfion error of fhe responses. 

More formally, we consider a convolutional layer 
wifh a filler size of fc x fc x c, where k is fhe spatial size 
of fhe filler and c is fhe number of inpuf charmels of 
fhis layer. To compufe a response, fhis filler is applied 
on a k X k X c volume of fhe layer inpuf. We use x G 
c-i-i denofe a vecfor fhaf reshapes fhis volume, 
where we append one as fhe lasf enfry for fhe sake of 
fhe bias. A response y € af a position of a layer is 
compufed as: 

y = Wx. (1) 

where W is a (i-by-(fc^c-l-l) mafrix, and d is fhe number 
of fillers. Each row of W denofes fhe reshaped form 
of a k X k X c filler with the bias appended. 

Under the assumption that the vector y is on a low- 
rank subspace, we can write y = M(y — y) + y, where 
M is a d-hy-d matrix of a rank d' < d and y is fhe 
mean vecfor of responses. Expanding fhis equation, 
we can compufe a response by: 

y = MWx -f b, (2) 

where b = y —My is a new bias. The rank-d' mafrix M 
can be decomposed info fwo d-hy-d' mafrices P and 
Q such fhaf M = PQ^. We denofe W' = Q^W as a 
d'-by-(fc^c-l-l) mafrix, which is essentially a new set of 
d' filters. Then we can compute (2) by: 

y = PW'x -h b. (3) 

The complexity of using Eqn.(3) is 0{d'k‘^c) 0{dd'), 

while fhe complexify of using Eqn.(l) is 0{dk^c). 
Eor many f 5 q)ical models/layers, we usually have 
0{dd') <C 0{d'k'^c), so fhe compulation in Eqn.(3) will 
reduce fhe complexify fo abouf d'/d. 

Pig. 1 illusfrafes how fo use Eqn.(3) in a nefwork. 
We replace fhe original layer (given by W) by fwo 


layers (given by W' and P). The mafrix W' is acfually 
d' fillers whose sizes are k x k x c. These fillers 
produce a d'-dimensional feafure map. On fhis feafure 
map, fhe d-hy-d' mafrix P can be implemenfed as d 
fillers whose sizes are 1 x 1 x d'. So P corresponds 
fo a convolutional layer wifh a 1x1 spatial supporf, 
which maps fhe d'-dimensional feafure map fo a d- 
dimensional one. 

Nofe fhaf fhe decomposition of M = PQ^ can be 
arbifrary. If does nof impacf fhe value of y compufed 
in Eqn.(3). A simple decomposition is fhe Singular 
Value Decomposition (SVD) [31]: M = Ud'S^Wd'^, 
where U^' and Yd' are d-hy-d' column-orfhogonal 
mafrices and Sd' is a d'-by-d' diagonal mafrix. Then 
we can obfain P = Vd'S]!'^ and Q = Vd'S^^. 

In practice fhe low-rank assumption does nof 
sfricfly hold, and fhe compulation in Eqn.(3) is ap¬ 
proximafe. To find an approximate low-rank sub¬ 
space, we optimize the following problem: 

i 

s.t. rank(M) < d'. 

Here y^ is a response sampled from fhe feafure maps 
in fhe framing sef. This problem can be solved by 
SVD [31] or acfually Principal Component Analysis 
(PCA): let Y be the d-by-n matrix concatenating n 
responses with the mean subtracted, compute the 
eigen-decomposition of fhe covariance mafrix YY^ = 
USU^ where U is an orfhogonal mafrix and S is 
diagonal, and M = where \Jd' are fhe firsf 

d' eigenvecfors. Wifh fhe mafrix M compufed, we can 
find P = Q = Ud'. 

How good is fhe low-rank assumption? We sample 
fhe responses from a CNN model (wifh 7 convolu¬ 
tional layers, detailed in Sec. 4) trained on ImageNet. 
Eor the responses of each layer, we compufe fhe 
eigenvalues of fheir covariance mafrix and fhen plof 
fhe sum of fhe largesf eigenvalues (Eig. 2). We see 
fhaf subsfanfial energy is in a small portion of fhe 
largesf eigenvectors. Eor example, in fhe Conv2 layer 
(d = 256) fhe firsf 128 eigenvectors confribufe over 
99.9% energy; in fhe Conv7 layer (d = 512), fhe firsf 
256 eigenvectors confribufe over 95% energy. This 
indicafes fhaf we can use a fraction of fhe filters fo 
precisely approximafe fhe original filters. 

The low-rank behavior of fhe responses y is because 
of fhe low-rank behaviors of fhe fiber weighfs W 
and fhe inpufs x. Alfhough fhe low-rank assumptions 
abouf fiber weighfs W have been adopfed in recent 
work [16], [17], we further adopt the low-rank as¬ 
sumptions about the filter inputs x, which are local 
volumes and have correlations. The responses y will 
have lower rank than W and x, so the approximation 
can be more precise. In our optimization (4), we 
directly address the low-rank subspace of y. 
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Figure 2: PCA accumulative energy of the responses in each layer, presented as the sum of largest d! eigenvalues 
(relative to the total energy when d! = d). Here the filter number d is 96 for Convl, 256 for Conv2, and 512 
for Conv3-7 (detailed in Table 1). These figures are obtained from 3,000 randomly sampled training images. 


3.2 Nonlinear Case 

Next we investigate the case of using nonlinear units. 
We use r( ) to denote the nonlinear operator. In this 
paper we focus on the Rectified Linear Unit (ReLU) 
[32]: r(-) = max(-, 0). 

Driven by Eqn.(4), we minimize the reconstruction 
error of the nonlinear responses: 


This optimization problem also has a closed-form 
solution by Generalized SVD (GSVD) [28], [29], [30]. 
Let Z be the d-by-n matrix concatenating the vectors 
of {zi — z}. We rewrite the above problem as: 

min||Z-MY|||, (8) 

M 

s.t. rank(M) < d'. 


minXl -6b)||^, (5) 

i 

s.t. rank{M) < d'. 

Here b is a new bias to be optimized, and r(My-l-b) = 
r(MWx -I- b) is the nonlinear response computed by 
the approximated filters. 

The above optimization problem is challenging due 
to the nonlinearity and the low-rank constraint. To 
find a feasible solution, we relax it as: 

1 H “ ?'(z*)ll2 + - (My* -6 b)||2 

i 

s.t. rank(M) < d'. (6) 

Here {z^} is a set of auxiliary variables of the same 
size as {yi}. A is a penalty parameter. If A —> oo, the 
solution to (6) will converge to the solution to (5) [33]. 
We adopt an alternating solver, fixing {z^} and solving 
for M, b and vice versa. 

(i) The subproblem of M, b. In this case, {z^} are 
fixed. It is easy to show that b is solved by b = z — 
My where z is the mean vector of {z^}. Substituting 
b into the objective function, we obtain the problem 
involving M: 

( 7 ) 

S.t. rank{M) < d!. 

This problem appears similar to Eqn.(4) except that 
there are two sets of responses. 


Here || • ||f is the Erobenius norm. A problem in 
this form is known as Reduced Rank Regression 
[28], [29], [30]. This problem belongs to a broader 
category of procrustes problems [28] that have been 
adopted for various data reconstruction problems 
[34], [35], [36]. The solution is as follows (see [30]). 
Let M = ZY^(YY^)“^. GSVD [30] is applied on 
M: M = USV^, such that U is a d-hy-d orthogonal 
matrix satisfying U^U = U where U is a d-hy-d 
identity matrix, and V is a d-hy-d matrix satisfying 
V^YY^V = Id (called generalized orthogonality). Then 
the solution M to (8) is given by M = 
where Ud' and Vd' are the first d' columns of U and 
V and Sd' are the largest d' singular values. One can 
show that if Z = Y (so the problem in (7) becomes 
(4)), this GSVD solution becomes SVD, i.e., eigen- 
decomposition of YY^. 

(ii) The subproblem of {z^}. In this case, M and b 
are fixed. Then in this subproblem each element of 
each vector z^ is independent of any other. So we solve 
a 1-dimensional optimization problem as follows: 

min {r{y,j) - r(zij))2 -6 A(zy - (9) 

Zij 

where j/L is the j-th entry of My^ -|- b. By separately 
considering Zij > 0 and Zij < 0, we obtain the solution 
as follows: let 


Zo = min(0, yh) 
A-wL. 

zi = max(0 


• y'tj + r{y^J) , 


A + l 


( 10 ) 

( 11 ) 
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then Zij = argmin 2 o, 2 ^(r( 2 /ij) - r{z^j))'^ + \{zij - 

Our method is also applicable for other types of non- 
linearifies. The subproblem in (9) is a 1-dimensional 
nonlinear leasf squares problem, so can be solved by 
gradienf descenf for ofher r( ). 

We alfernatively solve (i) and (ii). The inifializafion 
is given by fhe solufion fo fhe linear case (4). We warm 
up the solver by setting the penalty parameter A = 
0.01 and run 25 iterations. Then we increase the value 
of A. In fheory A should be gradually increased fo 
infinify [33]. Buf we find fhaf if is difficulf for fhe 
iferafive solver fo make progress if A is too large. So 
we increase A fo 1, run 25 more iferafions, and use 
the resulting M as our solution. As before, we obfain 
P and Q by SVD on M. 

In experimenfs, we find fhaf if is sufficienf fo ran¬ 
domly sample 3,000 images fo solve Eqn.(5). If only 
fakes our mefhod 2-5 minutes in MATLAB solving a 
layer. This is much faster than SGD-based solvers. 

3.3 Asymmetric Reconstruction for Muiti-Layer 

When each layer is approximated independently, the 
error of shallower layers will be rapidly accumulafed 
and affecf deeper layers. We propose an asymmefric 
reconsfrucfion mefhod fo alleviate fhis problem. 

We apply our mefhod sequentially on each layer, 
from the shallower layers to the deeper ones. Let 
us consider a layer whose input feature map is not 
precise due to the approximation of fhe previous 
layer/layers. We denote fhe approximate inpuf fo fhe 
currenf layer as x. For fhe framing dafa, we can sfill 
compufe ifs non-approximafe responses as y = Wx. 
So we can opfimize an "asymmefric" version of (5): 

min^ ||r(Wxi) — r(MWxi-I-b)||2, (12) 

i 

s.t. rank{M) < d!. 

In fhe firsf ferm r(Wx) = r(y) is fhe non-approximafe 
oufpuf of fhis layer. In fhe second ferm, x^ is fhe 
approximafed inpuf fo fhis layer, and r(MWxi -|- b) 
is fhe approximafed oufpuf of fhis layer. In confrasf 
fo using X (or x) for bofh terms, fhis asymmefric 
formulafion faifhfully incorporafes fhe two acfual 
terms before/after fhe appro ximafion of fhis layer. 
The opfimizafion problem in (12) can be solved using 
the same algorithm as for (5). 

3.4 Rank Selection for Whole-Model Acceleration 

tel the above, the optimization is based on a target d' 
of each layer, d' is the only parameter that determines 
the complexity of an accelerated layer. But given a 
desired speedup ratio of the whole model, we need to 
determine the proper rank d' used for each layer. One 
may adopt a uniform speedup ratio for each layer. But 
this is not an optimal solution, because the layers are 
not equally redundant. 



Figure 3: PC A accumulative energy and the accuracy 
rates (top-5). Here the accuracy is evaluated using the 
linear solution (the nonlinear solution has a similar 
trend). Fach layer is evaluated independently, with 
other layers not approximated. The accuracy is shown 
as the difference to no approximation. 

We empirically observe that the PCA energy after 
approximations is roughly related to the classification 
accuracy. To verify this observation, in Fig. 3 we 
show the classification accuracy (represented as the 
difference to no approximation) vs. the PCA energy. 
Each point in this figure is empirically evaluated 
using a reduced rank d'. 100% energy means no ap¬ 
proximation and thus no degradation of classification 
accuracy. Fig. 3 shows that the classification accuracy 
is roughly linear on the PCA energy. 

To simultaneously determine the reduced ranks of 
all layers, we further assume that the whole-model 
classification accuracy is roughly related to the prod¬ 
uct of the PCA energy of all layers. More formally, we 
consider this objective function: 

^ = ( 13 ) 

I a—1 

Here ai^a is the a-th largest eigenvalue of the layer 
I, and J2a‘=i ^i,a is the PCA energy of the largest 
eigenvalues in the layer 1. The product ]/[; is over all 
layers to be approximated. The objective £ is assumed 
to be related to the accuracy of the approximated 
whole network. Then we optimize this problem: 

max£, s.t. ^Ci < C. (14) 

Kl 

Here di is the original number of filters in the layer I, 
and Cl is the original time complexity of the layer 1. 
So is the complexity after the approximation. C 
is the total complexity after the approximation, which 
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layer 

filter size 

# channels 

# filters 

stride 

output size 

complexity (%) 

# of zeros 

Convl 

7x7 

3 

96 

2 

109 X 109 

3.8 

0.49 

Pooll 

3x3 



3 

37 X 37 



Conv2 

5x5 

96 

256 

1 

35 X 35 

17.3 

0.62 

Pool2 

2x2 



2 

18 X 18 



Conv3 

3x3 

256 

512 

1 

18 X 18 

00 

00 

0.60 

Conv4 

3x3 

512 

512 

1 

18 X 18 

17.5 

0.69 

Conv5 

3x3 

512 

512 

1 

18 X 18 

17.5 

0.69 

Conv6 

3x3 

512 

512 

1 

18 X 18 

17.5 

0.68 

Conv7 

3x3 

512 

512 

1 

18 X 18 

17.5 

0.95 


Table 1: The architecture of the SPP-10 model [7]. It has 7 conv layers and 3 fc layers. Each layer (except the 
last fc) is followed by ReLU. The final conv layer is followed by a spafial pyramid pooling layer [7] fhaf have 
4 levels ({6 x 6,3 x 3,2 x 2,1 x 1}, fofally 50 bins). The resulting 50 x 512-d is fed info fhe 4096-d fc layer 
(fc6), followed by anofher 4096-d fc layer (fc7) and a 1000-way soffmax layer. The column "complexify" is fhe 
fheorefical time complexify, shown as relative numbers fo fhe fofal convolutional complexify. The column "# 
of zeros" is fhe relafive portion of zero responses, which shows fhe "sparsify" of fhe layer. 


is given by fhe desired speedup ratio. This opfimiza- 
fion problem means fhaf we wanf fo maximize fhe 
accumulafed energy subjecf fo fhe time complexify 
consfrainf. 

The problem in (14) is a combinatorial problem 
[37]. So we adopf a greedy sfrafegy fo solve if. We 
initialize dj as di, and consider fhe sef {to.o}- In each 
sfep we remove an eigenvalue ct; from fhis sef, 
chosen from a cerfain layer /. The relafive reducfion 
of fhe objective is AEfE = to,a/ and fhe 

reducfion of complexify is AC = ^C. Then we 
define a measure as The eigenvalue ai^d[ fhaf 

has fhe smallesf value of fhis measure is removed. 
Infuifively, fhis measure favors a small reducfion of 
ASjE and a large reducfion of complexify AC. This 
sfep is greedily iferafed, until fhe consfrainf of fhe 
fofal complexify is achieved. 

3.5 Higher-Dimensional Decomposition 

In our formulation, we focus on reducing fhe channels 
(from d fo d'). There are algorifhmic advanfages of op¬ 
erating on fhe channel dimension. Firsfly, fhis dimen¬ 
sion can be easily confrolled by fhe rank consfrainf 
ranfc(M) < d'. This consfrainf enables closed-form 
solutions, e.g., SVD or GSVD. Secondly, fhe opfimized 
low-rank projection M can be exacfly decomposed 
info low-dimensional fillers (P and Q). These simple 
and closed-form solutions can produce good resulfs 
using a very small subsef of framing images (3,000 
ouf of one million). 

On fhe ofher hand, compared wifh decomposifion 
mefhods fhaf operafe on mulfiple dimensions (spafial 
and channel) [17], our mefhod has fo use a smaller 
d' fo approach a given speedup rafio, which mighf 
limif fhe accuracy of our mefhod. To avoid d' being 
too small, we furfher propose fo combine our solver 
wifh Jaderberg et al.'s spafial decomposifion. Thanks 


fo our asymmefric reconsfrucfion, our mefhod can 
effectively alleviafe fhe accumulafed error for fhe 
mulfi-decomposifion. 

To defermined fhe decomposed archifecfure (buf 
nof yef fhe weighfs), we firsf use our mefhod fo 
decompose all conv layers of a model. This involves 
fhe rank selection of d' for all layers. Then we apply 
Jaderberg et al.'s mefhod fo furfher decompose fhe 
resulting kxk layers (fc > 1) info fc x 1 and 1 x fc filfers. 
The firsf fc x 1 layer has d" oufpuf channels depending 
on fhe speedup rafio. In fhis way, an original layer of 
(fc X fc, d) is decomposed info fhree layers of (fc x 1, 
d"), (1 X fc, d'), and (1 x 1, d). For a speedup rafio r, 
we lef each mefhod confribufe a speedup of -y/r. 

Wifh fhe decomposed archifecfure defermined, we 
solve for fhe weighfs of fhe decomposed layers. Given 
fheir order as above, we firsf opfimize fhe (fc x 1, 
d") and (1 x fc, d) layers using "filter reconsfrucfion" 
[17] (we will discuss "dafa reconsfrucfion" lafer). Then 
we adopf our solufion on fhe (1 x fc, d) layer and 
opfimize for fhe (1 x fc, d') and (1 x 1, d) layers. We 
use our asymmefric reconsfrucfion in Eqn.(12). In fhe 
r(MWx-l-b) term, x is fhe approximated inpuf fo fhis 
1 X fc layer, and fhe r(Wx) = r(y) term is sfill fhe 
frue response of fhe original kxk layer wifhouf any 
decomposifion. The approximation error of fhe spafial 
decomposifion will also be addressed by our asym¬ 
mefric reconsfrucfion, which is imporfanf fo alleviafe 
accumulafed error. We term fhis as "asymmefric (3d)" 
in fhe following. 

3.6 Fine-tuning 

With any approximated whole model, we may "fine- 
tune" this model end-to-end in the ImageNet training 
data. This process is similar to training a classification 
network with the approximated model as the initial¬ 
ization. 
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Conv2 Speedup 




Conv3 Speedup Conv4 Speedup 
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Figure 4: Linear vs. Nonlinear for SPP-10: single-layer performance of accelerafing Convl fo Conv7. The 
speedup rafios are compufed by fhe fheorefical complexify of fhaf layer. The error rafes are fop-5 single-view, 
and shown as fhe increase of error rafes compared wifh no approximafion {smaller is better). 


However, we empirically find fhaf fine-funing is 
very sensifive fo fhe inifializafion (given by fhe ap- 
proximafed model) and fhe learning rafe. If fhe inifial¬ 
izafion is poor and fhe learning rafe is small, fhe fine- 
funing is easily frapped in a poor local opfimum and 
makes little progress. If fhe learning rafe is large, fhe 
fine-funing process behaves very similar fo framing 
fhe decomposed archifecfure "from scrafch" (as we 
will discuss lafer). A large learning rafe may jump ouf 
of fhe inifialized local opfimum, and fhe inifializafion 
appears fo be "forgoffen". 

Forfunafely, our mefhod has achieved very good 
accuracy even wifhouf fine-funing as we will show 
by experimenfs. Wifh our approximafed model as fhe 
inifializafion, fhe fine-funing wifh a sufficienfly small 
learning rafe is able fo furfher improve fhe resulfs. In 
our experimenfs, we use a learning rafe of le-5 and a 
mini-bafch size of 128, and fine-fune fhe models for 5 
epochs in fhe ImageNef framing dafa. 

We nofe fhaf in fhe following fhe resulfs are without 
fine-tuning unless specified. 

4 Experiments 

We comprehensively evaluafe our mefhod on two 
models. The firsf model is a 10-layer model of "SPPnef 
(OverFeaf-7)" in [7], which we denofe as "SPP-10". 
This model (defailed in Table 1) has a similar archi¬ 
fecfure fo fhe OverFeaf model [6] buf is deeper. If has 
7 conv layers and 3 fc layers. The second model is 
fhe publicly available VGG-16 model [1]^ fhaf has 13 

1. www.robots.ox.ac.uk/~vgg/research/very_deep/ 


conv layers and 3 fc layers. SPP-10 won fhe 3-rd place 
and VGG-16 won fhe 2-nd place in ILSVRC 2014 [19]. 

We evaluafe fhe "fop-5 error" using single-view 
fesfing. The view is fhe cenfer 224x 224 region cropped 
from fhe resized image whose shorfer side is 256. 
The single-view error rafe of SPP-10 is 12.51% on fhe 
ImageNef validafion sef, and VGG-16 is 10.09% in our 
fesfing (which is consisfenf wifh fhe number reporfed 
by [1]^). These numbers serve as fhe references for fhe 
increased error rafes of our approximafed models. 

4.1 Experiments with SPP-10 

We first evaluate the effect of our each step on the 
SPP-10 model by a series of controlled experiments. 
Unless specified, we do not use the 3-d decomposi¬ 
tion. 

Single-Layer: Linear us. Nonlinear 

In this subsection we evaluate the single-layer per¬ 
formance. When evaluating a single approximated 
layer, the remaining layers are unchanged and not ap¬ 
proximated. The speedup ratio (involving that single 
layer only) is shown as the theoretical ratio computed 
by the complexity. 

In Fig. 4 we compare the performance of our linear 
solution (4) and nonlinear solution (6). The perfor¬ 
mance is displayed as increase of error rates (decrease 
of accuracy) vs. the speedup ratio of that layer. Fig. 4 
shows that the nonlinear solution consistently per¬ 
forms better than the linear solution. In Table 1, we 

2. http: / / WWW. vlfeat.org/matconvnet/pretrained/ 
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Figure 5: Symmetric vs. Asymmetric for SPP-10: the cases of 2-layer and 3-layer approximation. The speedup is 
computed by the complexity of fhe layers approximafed. (a) Approximation of Conv6 & 7. (b) Approximafion 
of Conv2, 3 & 4. (c) Approximafion of ConvS, 6 & 7. 


show fhe sparsify (fhe porfion of zero acfivafions 
affer ReLU) of each layer. A zero acfivafion is due 
fo fhe fruncation of ReLU. The sparsify is over 60% 
for Conv2-7, indicafing fhat fhe ReLU fakes effecf 
on a subsfanfial portion of acfivafions. This explains 
the discrepancy between the linear and nonlinear 
solutions. Especially, the Conv7 layer has a sparsity 
of 95%, so fhe advanfage of fhe nonlinear solufion is 
more obvious. 

Fig. 4 also shows fhat when accelerating only a 
single layer by 2x, the increased error rates of our 
solufions are rafher marginal or negligible. For fhe 
Conv2 layer, fhe error rafe is increased by < 0.1%; 
for the Conv3-7 layers, the error rate is increased by 
« 0.2%. 

We also notice that for Convl, the degradation is 
negligible near 2x speedup (1.8x corresponds to d' = 
32). This can be explained by Fig. 2(a): the PC A energy 
has little loss when d' > 32. But the degradation can 
grow quickly for larger speedup ratios, because in this 
layer the channel number c = 3 is small and d' needs 
to be reduced drastically to achieve the speedup ratio. 
So in the following whole-model experimenfs of SPP- 
10, we will use d' = 32 for Convl. 

Multi-Layer: Symmetric ns. Asymmetric 

Next we evaluate the performance of asymmefric 
reconsfrucfion as in fhe problem (12). We demonsfrafe 
approximafing 2 layers or 3 layers. In fhe case of 2 
layers, we show fhe resulfs of approximating Conv6 
and 7; and in the case of 3 layers, we show fhe 
resulfs of approximafing Conv5-7 or Conv2-4. The 
comparisons are consisfenfly observed for ofher cases 
of mulfi-layer. 

We sequentially approximate the layers involved, 
from a shallower one fo a deeper one. In fhe asym¬ 
mefric version (12), x is from fhe oufpuf of fhe pre¬ 
vious approximated layer (if any), and x is from fhe 
oufpuf of fhe previous non-approximafe layer. In fhe 


symmefric version (5), we use x for bofh ferms. We 
have also fried anofher symmefric version of using x 
for bofh ferms, and found fhis symmefric version is 
even worse. 

Fig. 5 shows fhe comparisons befween fhe symmef¬ 
ric and asymmefric versions. The asymmefric solu¬ 
tion has significant improvement over the symmet¬ 
ric solution. For example, when only 3 layers are 
approximated simultaneously (like Fig. 5 (c)), the 
improvement is over 1.0% when the speedup is 4x. 
This indicates that the accumulative error rate due to 
multi-layer approximation can be effectively reduced 
by the asymmetric version. 

When more and all layers are approximated si¬ 
multaneously (as below), if wifhouf fhe asymmefric 
solufion, fhe error rafes will increase more drasfically. 

Whole-Model: with/without Rank Selection 

In Table 2 we show the results of whole-model 
accelerafion. The solver is fhe asymmefric version. 
For Convl, we fix d' — 32. For ofher layers, when 
the rank selection is not used, we adopt the same 
speedup ratio on each layer and determine its desired 
rank d' accordingly When the rank selection is used, 
we apply it to select d' for Conv2-7. Table 2 shows 
fhaf fhe rank selecfion consisfenfly oufperforms fhe 
counferparf wifhouf rank selecfion. The advanfage of 
rank selecfion is observed in bofh linear and nonlinear 
solufions. 

In Table 2 we nofice fhaf fhe rank selecfion offen 
chooses a higher rank d' (fhan fhe no rank selecfion) 
in Conv5-7. For example, when the speedup is 3x, 
the rank selection assigns d' = 167 to Conv7, while 
this layer only requires d' = 153 to achieve 3x single¬ 
layer speedup of ifself. This can be explained by 
Fig. 2(c). The energy of Conv5-7 is less concenfrafed, 
so fhese layers require higher ranks fo achieve good 
approximafions. 

As we will show, fhe rank selecfion is more promi- 
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speedup 

rank sel. 

Convl 

Conv2 

Conv3 

Conv4 

Conv5 

Conv6 

Conv7 

err. J % 

2x 

no 

32 

110 

199 

219 

219 

219 

219 

1.18 

2x 

yes 

32 

83 

182 

211 

239 

237 

253 

0.93 

2.4x 

no 

32 

96 

174 

191 

191 

191 

191 

1.77 

2.4x 

yes 

32 

74 

162 

187 

207 

205 

219 

1.35 

3x 

no 

32 

77 

139 

153 

153 

153 

153 

2.56 

3x 

yes 

32 

62 

138 

149 

166 

162 

167 

2.34 

4x 

no 

32 

57 

104 

115 

115 

115 

115 

4.32 

4x 

yes 

32 

50 

112 

114 

122 

117 

119 

4.20 

5x 

no 

32 

46 

83 

92 

92 

92 

92 

6.53 

5x 

yes 

32 

41 

94 

93 

98 

92 

90 

6.47 


Table 2: Whole-model acceleration with/without rank selection for SPP-10. The solver is the asymmetric 
version. The speedup ratios shown here involve all convolutional layers (Convl-Conv7). We fix d! = 32 in 
Convl. In the case of no rank selection, the speedup ratio of each other layer is the same. Each column of 
Convl-7 shows the rank d! used, which is the number of filters after approximation. The error rates are top-5 
single-view, and shown as the increase of error rates compared with no approximation. 


nent for VGG-16 because of its diversity of layers. 

Comparisons with Jaderberg et al.'s method [17] 

We compare with Jaderberg et al.'s method [17], 
which is a recent state-of-the-art solution to efficient 
evaluation. Although our decomposition shares some 
high-level motivations as [17], we point out that our 
optimization strategy is different with [17] and is im¬ 
portant for accuracy, especially for very deep models 
that previous acceleration methods rarely addressed. 

Jaderberg et al.'s method [17] decomposes a k x k 
spatial support into a cascade of fc x 1 and 1 x fc 
spatial supports. A channel-dimension reduction is 


15 
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Figure 6: Gomparisons with Jaderberg et al.'s spatial 
decomposition method [17] for SPP-10. The speedup 
ratios are theoretical speedups of the whole model. 
The error rates are top-5 single-view, and shown as 
the increase of error rates compared with no approx¬ 
imation {smaller is better). 



3x 4x 5x 


Speedup Ratio 


also considered. Their optimization method focuses 
on the linear reconstruction error. In the paper of [17], 
their method is only evaluated on a single layer of an 
OverFeat network [6] for ImageNet. 

Our comparisons are based on our implementation 
of [17]. We use the Scheme 2 decomposition in [17] 
and its "filter reconstruction" version (as we explain 
below), which is used for ImageNet as in [17]. Our 
reproduction of the filter reconstruction in [17] gives 
a 2x single-layer speedup on Gonv2 of SPP-10 with 
0.2% increase of error. As a reference, in [17] it re¬ 
ports 0.5% increase of error on Gonv2 under a 2x 
single-layer speedup, evaluated on another OverFeat 
network [6] similar to SPP-10. 

It is worth discussing our implementation of Jader¬ 
berg et al.'s [17] "data reconstruction" scheme, which 
was suggested to use SGD and backpropagation for 
optimization. In our reproduction, we find that data 
reconstruction works well for the character classification 
task as studied in [17]. However, we find it nontrivial 
to make data reconstruction work for large models 
trained for ImageNet. We observe that the learning 
rate needs to be carefully chosen for the SGD-based 
data reconstruction to converge (as also reported 
independently in [18] for another decomposition), 
and when the training starts to converge, the results 
are still sensitive to the initialization (for which we 
have tried Gaussian distributions of a wide range 
of variances). We conjecture that this is because the 
ImageNet dataset and models are more complicated, 
and using SGD to regress a single layer may be 
sensitive to multiple local optima. In fact, Jaderberg 
et al.'s [17] only report "filter reconstruction" results 
of a single layer on ImageNet. For these reasons, 
our implementation of Jaderberg et al.'s method on 
ImageNet models is based on filter reconstruction. 
We believe that these issues have not be settled and 
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model 

speedup 

solution 

top-5 err. 
(1-view) 

CPU 

(ms) 

GPU 

(ms) 

SPP-10 [7] 

- 

12.5 

930 

7.67 


Jaderberg et al. [17] (our impl.) 

18.5 

278 (3.3 x) 

2.41 (3.2 x) 

SPP-10 (4x) 

our asym. 

16.7 

271 (3.4x) 

2.62 (2.9 x) 


our asym. (3d) 

14.1 

267 (3.5 x) 

2.32 (3.3 x) 


our asym. (3d) FT 

13.8 

267 (3.5 x) 

2.32 (3.3 x) 

AlexNet [4] 

- 

18.8 

273 

2.37 


Table 3: Comparisons of absolute performance of SPP-10. The top-5 error is the absolute value. The rurming 
time is a single view on a CPU (single thread, with SSE) or a GPU. The accelerated models are those of 4x 
theoretical speedup (Fig. 6). On the brackets are the actual speedup ratios. 


need to be investigated further, and accelerating deep 
networks does not just involve decomposition but also 
the way of optimization. 

In Fig. 6 we compare our method with Jaderberg et 
al.'s [17] for whole-model speedup. For whole-model 
speedup of [17], we implement their method sequen¬ 
tially on Conv2-7 using the same speedup ratio.^ The 
speedup ratios are the theoretical complexity ratios 
involving all convolutional layers. Our method is the 
asymmetric version and with rank selection. Fig. 6 
shows that when the speedup ratios are large (4x 
and 5x), our method outperforms Jaderberg et al.'s 
method significantly. For example, when the speedup 
ratio is 4x, the increased error rate of our method 
is 4.2%, while Jaderberg et al.'s is 6.0%. Jaderberg 
et al.'s result degrades quickly when the speedup 
ratio is getting large, while ours degrades slowly. 
This suggests the effects of our method for reducing 
accumulative error. 

We further compare with our asymmetric version 
using 3d decomposition (Sec. 3.5). In Fig. 6 we show 
the results "asymmetric (3d)''. Fig. 6 shows that this 
strategy leads to significantly smaller increase of error. 
For example, when the speedup is 5x, the error 
is increased by only 2.5%. Our asymmetric solver 
effectively controls the accumulative error even if 
the multiple layers are decomposed extensively, and 
the 3d decomposition is easier to achieve a certain 
speedup ratio. 

For completeness, we also evaluate our approxima¬ 
tion method on the character classification model re¬ 
leased by [17]. Our asymmetric (3d) solution achieves 
4.5X speedup with only a drop of 0.7% in classifica¬ 
tion accuracy, which is better than the 1% drop for the 
same speedup reported by [17]. 

Comparisons with Training from Scratch 

The architecture of the approximated model can 

3. We do not apply Jaderberg et al.'s method [17] on Convl, 
because this layer has a small number of input channels (3), and 
the first fc x 1 decomposed layer can only have a very small number 
of filters {e.g., 5) to approach a speedup ratio (e.g., 4x). Also note 
that the speedup ratio is about all conv layers, and because Convl 
is not accelerated, other layers will have a slightly larger speedup. 


also be trained "from scratch" on the ImageNet dataset. 
One h 5 rpothesis is that the underlying architecture is 
sufficiently powerful, and the acceleration algorithm 
might be not necessary. We show that this h 5 rpothesis 
is premature. 

We directly train the model of the same architecture 
as the decomposed model. The decomposed model is 
much deeper than the original model (each layer re¬ 
placed by three layers), so we adopt the initialization 
method in [38] otherwise it is not easy to converge. We 
train the model for 100 epochs. We follow the common 
practice in [39], [7] of training ImageNet models. 

The comparisons are in Table 4. The accuracy of the 
model trained from scratch is worse than that of our 
accelerated model by a considerable margin (2.8%). 
These results indicate that the accelerating algorithms 
can effectively digest information from the trained 
models. They also suggest that the models trained 
from scratch have much redundancy. 


model 

top-5 err. 
(1-view) 

increased err. 
(1-view) 

SPP-10 [7] 

12.5 

- 

our asym. 3d (4x) 

14.1 

1.6 

from scratch 

16.9 

4.4 


Table 4: Comparisons with the same decomposed 
architecture trained from scratch. 

Comparisons of Absolute Performance 

Table 3 shows the comparisons of the absolute 
performance of the accelerated models. We also eval¬ 
uate the AlexNet [4] which is similarly fast as our 
accelerated 4x models. The comparison is based on 
our re-implementation of AlexNet. Our AlexNet is the 
same as in [4] except that the GPU splitting is ignored. 
Our re-implementation of this model has top-5 single¬ 
view error rate as 18.8% (10-view top-5 16.0% and top- 
1 37.6%). This is better than the one reported in [4]^. 

The models accelerated by our asymmetric (3d) 
version have 14.1% and 13.8% top-5 error, without 

4. In [4] the 10-view error is top-5 18.2% and top-1 40.7%. 
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layer 

filter size 

# channels 

# filters 

stride 

output size 

complexity (%) 

# of zeros 

Convl 1 

3x3 

3 

64 

1 

224 x 224 

0.6 

0.48 

Convl 2 

3x3 

64 

64 

1 

224 x 224 

12.0 

0.32 

Pooll 

3x3 



2 

112 x 112 



Conv2i 

3x3 

64 

128 

1 

112 x 112 

6.0 

0.35 

Conv22 

3x3 

128 

128 

1 

112 x 112 

12.0 

0.52 

P 00 I 2 

2x2 



2 

56 x 56 



Conv3i 

3x3 

128 

256 

1 

56 x 56 

6.0 

0.48 

Conv32 

3x3 

256 

256 

1 

56 x 56 

12.1 

0.48 

Conv33 

3x3 

256 

256 

1 

56 x 56 

12.1 

0.70 

P 00 I 3 

2x2 



2 

28 x 28 



Conv4i 

3x3 

256 

512 

1 

28 x 28 

6.0 

0.65 

Conv42 

3x3 

512 

512 

1 

28 x 28 

12.1 

0.70 

Conv43 

3x3 

512 

512 

1 

28 x 28 

12.1 

0.87 

PooM 

2x2 



2 

14 x 14 



ConvSi 

3x3 

512 

512 

1 

14 x 14 

3.0 

0.76 

Conv52 

3x3 

512 

512 

1 

14 x 14 

3.0 

0.80 

Conv53 

3x3 

512 

512 

1 

14 x 14 

3.0 

0.93 


Table 5: The architecture of the VGG-16 model [1]. It has 13 conv layers and 3 fc layers. The column 
"complexity" is the theoretical time complexity, shown as relative numbers to the total convolutional 
complexity. The column "# of zeros" is fhe relafive porfion of zero responses, which shows fhe "sparsify" 
of fhe layer. 


speedup 

rank sel. 

Cli 

CI 2 

C2i 

C22 

C3i 

C32 

C33 

C4i 

C42 

C43 

C5i 

C52 

C53 

err. t % 

2x 

no 

64 

28 

52 

57 

104 

115 

115 

209 

230 

230 

230 

230 

230 

0.99 

2x 

yes 

64 

18 

41 

50 

94 

96 

116 

207 

213 

260 

467 

455 

442 

0.28 

3x 

no 

64 

19 

34 

38 

69 

76 

76 

139 

153 

153 

153 

153 

153 

3.25 

3x 

yes 

64 

15 

31 

34 

68 

64 

75 

134 

126 

146 

312 

307 

294 

1.66 

4x 

no 

64 

14 

26 

28 

52 

57 

57 

104 

115 

115 

115 

115 

115 

6.38 

4x 

yes 

64 

11 

25 

28 

52 

46 

56 

104 

92 

100 

232 

224 

214 

3.84 


Table 6: Whole-model acceleration with/without rank selection for VGG-16. The solver is fhe asymmefric 
version. The speedup ratios shown here involve all convolutional layers. We do nof accelerafe Convl i. In fhe 
case of no rank selecfion, fhe speedup ratio of each ofher layer is fhe same. Each column of CI 2 -C 53 shows 
fhe rank d! used, which is fhe number of filfers affer approximafion. The error rafes are fop-5 single-view, and 
shown as fhe increase of error rafes compared wifh no approximafion. 


and wifh fine-funing. This means fhaf fhe accelerafed 
model has 5.0% lower error fhan AlexNef, while ifs 
speed is nearly fhe same as AlexNef. 

Table 3 also shows fhe acfual running time per view, 
on a C++ implemenfafion and Infel i7 CPU (2.9GHz) 
or Nvidia K40 GPU. In our CPU version, our mefhod 
has acfual speedup ratios (3.5 x) close fo fheorefical 
speedup ratios (4.0x). This overhead mainly comes 
from fhe fc and ofher layers. In our GPU version, fhe 
acfual speedup ratio is abouf 3.3 x. An accelerafed 
model is less easy for parallelism in a GPU, so fhe 
acfual ratio is lower. 

4.2 Experiments with VGG-16 

The very deep VGG models [1] have substantially 
improved a wide range of visual recognition tasks. 


including object detection [9], [2], [10], [11], semantic 
segmentation [12], [13], [14], [40], [41], image cap¬ 
tioning [42], [43], [44], video/action recognition [45], 
image question answering [46], texture recognition 
[47], etc. Considering the big impact yet slow speed 
of this model, we believe it is of practical significance 
to accelerate this model. 

Accelerating VGG-16 for ImageNet Classification 

Firstly we discover that our whole-model rank se¬ 
lection is particularly important for accelerating VGG- 
16. In Table 6 we show the results without/with 
rank selection. No 3d decomposition is used in this 
comparison. For a 4x speedup, the rank selection 
reduces the increased error from 6.38% to 3.84%. This 
is because of the greater diversity of layers in VGG- 
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increase of top-5 error (1-view) 

speedup ratio 

3x 

4x 

5x 

Jaderberg et al. [17] (our impl.) 

2.3 

9.7 

29.7 

our asym. (3d) 

0.4 

0.9 

2.0 

our asym. (3d) FT 

0.0 

0.3 

1.0 


Table 7: Accelerating the VGG-16 model [1] using a speedup ratio of 3x, 4x, or 5x. The top-5 error rate 
(1-view) of fhe VGG-16 model is 10.1%. This fable shows fhe increase of error on fhis baseline. 


model 

speedup 

solution 

top-5 error 
(1-view) 

CPU 

(ms) 

GPU 

(ms) 

VGG-16 [1] 

- 

10.1 

3287 

18.60 


Jaderberg et al. [17] (our impl.) 

19.8 

875 (3.8 x) 

6.40 (2.9 x) 

VGG-16 (4x) 

our asym. 

13.9 

875 (3.8 x) 

7.97 (2.3 x) 


our asym. (3d) 

11.0 

860 (3.8 x) 

6.30 (3.0 x) 


our asym. (3d) FT 

10.4 

858 (3.8 x) 

6.39 (2.9 x) 


Table 8: Absolufe performance of accelerating fhe VGG-16 model [1]. The fop-5 error is fhe absolufe value. 
The running fime is a single view on a CPU (single fhread, wifh SSE) or a GPU. The accelerafed models are 
fhose of 4x fheorefical speedup (Table 7). On fhe brackefs are fhe acfual speedup ratios. 


16 (Table 5). Unlike SPP-10 (or ofher shallower models 
[4], [5]) fhaf repeafedly applies 3x3 fillers on fhe 
same feafure map size, fhe VGG-16 model applies 
fhem more evenly on five feafure map sizes (224, 
112, 56, 28, and 14). Besides, as fhe filler numbers in 
Conv5i-53 are nof increased, fhe fime complexify of 
Conv5i-53 is smaller fhan ofhers. The selecfed ranks 
d' in Table 6 show fheir adapfivify - e.g., fhe layers 
Conv5i fo Convhs keep more fillers, because fhey 
have small fime complexify and it is not a good trade¬ 
off fo compacfly reduce fhem. The whole-model rank 
selection is a key fo mainfain a high accuracy for 
accelerating VGG-16. 

In Table 7 we evaluafe our mefhod on VGG-16 
for ImageNef classification. Here we evaluate our 
asymmetric 3d version (without or with fine-tuning). 
We evaluate challenging speedup ratios of 3 x, 4 x and 
5 X. The ratios are fhose of fhe fheorefical speedups of 
all 13 conv layers. 

Somewhaf surprisingly, our mefhod has demon- 
sfrafed compelling resulfs for fhis very deep model, 
even wifhouf fine-funing. Our no-fine-funing model 
has a 0.9% increase of 1-view fop-5 error for a speedup 
ratio of 4x. On fhe confrary, fhe previous mefhod [17] 
suffers greatly from fhe increased depfh because of fhe 
rapidly accumulated error of multiple approximafed 
layers. After fine-tuning, our model has a 0.3% in¬ 
crease of 1-view fop-5 error for a 4x speedup. This 
degradafion is even lower fhan fhaf of fhe shallower 
model of SPP-10. This suggesfs fhaf fhe informafion 
in fhe very deep VGG-16 model is highly redundanf, 
and our mefhod is able fo effectively digesf if. 

Fig. 7 shows fhe acfual vs. fheorefical speedup ratios 
of VGG-16 using CPU and GPU implemenfafions. The 
CPU speedup rafios are very close fo fhe fheorefical 


ratios. The GPU implemenfafion, which is based on 
fhe sfandard Caffe library [48], exhibifs a gap between 
actual vs. theoretical ratios (as is also witnessed in 
[49]). GPU speedup ratios are more sensitive to spe¬ 
cialized implementation, and the generic Caffe kernels 
are nof optimized for some layers {e.g., 1x1,1x3, and 
3x1 convolutions). We believe that a more specially 
engineered implementation will increase the actual 
GPU speedup ratio. 

Figurnov et al.'s work [49] is one of few exisfing 
works fhaf presenf resulfs of accelerafing fhe whole 
model of VGG-16. They reporf increased fop-5 1- 
view error rafes of 3.4% and 7.1% for acfual CPU 
speedups of 3x and 4x (for 4x fheorefical speedup 
fhey report a 3.8x actual CPU speedup). Thus our 



Figure 7: Actual vs. theoretical speedup ratios of VGG- 
16 using CPU and GPU implemenfafions. 
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conv speedup 

mAP 

AmAP 

baseline 

66.9 

- 

3x 

66.9 

0.0 

4x 

66.1 

-0.8 

5x 

65.2 

-1.7 


Table 9: Object detection mAP on the PASCAL VOC 
2007 test set. The detector is Fast R-CNN [2] using the 
pre-trained VGG-16 model. 

method is substantially more accurate than theirs. 
Note that results in [49] are after fine-tuning. This 
suggests that fine-tuning is not sufficient for whole- 
model acceleration; a good optimization solver for fhe 
decomposition is needed. 

Accelerating VGG-16 for Object Detection 

Current state-of-the-art object detection results [9], 
[2], [10], [11] mostly rely on the VGG-16 model. We 
evaluate our accelerated VGG-16 models for objecf 
defection. Our mefhod is based on fhe recenf Fasf R- 
GNN [2]. 

We evaluafe on fhe PASCAL VOC 2007 objecf defec- 
fion benchmark [20]. This dafasef confains 5k frainval 
images and 5k fesf images. We follow fhe defaulf 
seffing of Fasf R-CNN using fhe publicly released 
code®. We frain Fasf R-CNN on fhe frainval sef and 
evaluafe on fhe fesf sef. The accuracy is evaluafed by 
mean Average Precision (mAP). 

In our experimenfs, we firsf approximafe fhe VGG- 
16 model on fhe ImageNef classification fask. Then we 
use fhe approximafed model as fhe pre-frained model 
for Fasf R-CNN. We use our asymmefric 3d version 
wifh fine-funing. Nofe fhaf unlike image classificafion 
where fhe conv layers dominafe running time, for Fasf 
R-CNN defection fhe conv layers consume abouf 70% 
acfual running time [2]. The reporfed speedup ratios 
are fhe fheorefical speedups abouf fhe conv layers 
only. 

Table 9 shows fhe resulfs of fhe accelerafed models 
in PASCAL VOC 2007 defection. Our mefhod wifh a 
4x convolufion speedup has a graceful degradation 
of 0.8% in mAP. We believe fhis frade-off between 
accuracy and speed is of practical imporfance, because 
even wifh fhe recenf advance of fasf objecf defecfion 
[7], [2], fhe feafure exfracfion running fime is sfill 
considerable. 

5 Conclusion 

We have presenfed an acceleration mefhod for very 
deep networks. Our method is evaluated under 
whole-model speedup ratios. It can effectively reduce 
the accumulated error of multiple layers thanks to 
the nonlinear asymmetric reconstruction. Competitive 

5. https: / /github.com/rbgirshick/fast-rcrm 


speedups and accuracy are demonstrated in the com¬ 
plex ImageNef classification task and PASCAL VOC 
object detection task. 
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